1 Introduction

Determining the value of a home is becoming increasingly mechanized. With the maturation of statistical modeling techniques and machine learning, a process that was once driven largely by negotiations between individual actors—and the hard-to-measure factor of human taste—can now be abstracted by computers. The purpose of this project is to do just that: to devise a machine learning model that can accurately predict the sale price of a home based on both intrinsic and environmental factors.

Before any discussion of the techniques used to design this model, it is important to emphasize what, and whom, such a thing benefits. We do not believe that machines should learn merely for learning’s sake. A model that can predict home prices, however, stands to benefit many interests. The most obvious, of course, is home buyers and sellers who can refer to the model as a benchmark, obviating much of the needless back-and-forth that characterizes real estate negotiations.

But there are others who would be better off, too. Neighbors ought to have a sense of their local real estate market given that the value of one home tends to influence that of the next. Local governments, agencies, and other public service providers would also be beneficiaries since they need an accurate measure of the local economy to craft good policy. Finally, foreign investors and businesses that funnel private capital and create jobs in cities are not usually equipped with the local intelligence necessary to make investment decisions, relying inefficiently on the word of locals. An automated valuation model can help to give them that information easily.

The model used in this project is what is known as a hedonic model. Not to be confused with the ancient Greek school of philosophy, hedonic models refer to predictive models that synthesize a variety of discrete factors to derive a final prediction. Here, our hedonic model for home price predictions takes input factors from three primary categories: (1) the physical attributes of each property, (2) nearby public amenities or disamenities, and (3) the clustering of home prices in physical space (also known in the real estate industry as “comparables” or “comps”). A detailed list of the specific factors used in our model follows in the Data section below.

The accuracy of our results can be assessed using a host of different metrics, but the most salient is the R^2 value, measuring how much variation in price is explained by the model. The final model here returned an R^2 value of approximately 0.49, indicating that just under 50% of the variation was explained by the model.

2 Data

Every piece of data that was fed into the model falls under one of the three categories stated above (see Introduction). Although the bulk of the data on each home’s internal attributes was already given (e.g. square footage, number of rooms, etc.), data from the second and third categories—that is, nearby amenities or disamenities and spatial patterns—needed to be sourced externally.

Public data sources that were used to do so include the U.S. Census Bureau for demographic information, the State of Colorado’s open data portal for the locations of schools, and Boulder County’s open data portal for local ZIP codes and points of interest for recreation such as trailheads.

2.1 Regression Summary Statistics

The summary statistics of the full model are presented below.

Dependent variable:
price
med_HH_Income -1.925***
(0.497)
pct.over75K 1,073,229.000***
(150,718.600)
pct.Information -2,203,228.000***
(360,495.200)
pct.Finance 1,594.924
(262,495.100)
pct.Professional -181,448.900
(152,270.900)
pct.Ed_Health -446,681.800***
(135,859.400)
nbrBedRoom 8,919.376*
(5,277.324)
nbrFullBaths -20,167.680***
(6,382.267)
TotalFinishedSF 159.369***
(8.851)
AcDscrEvaporative Cooler 52,784.020
(100,480.700)
AcDscrNo AC 53,806.290
(96,560.950)
AcDscrWhole House 63,817.470
(96,562.460)
Age 1,225.806***
(226.321)
schools_nn3 -32.421***
(4.985)
trailheads_nn5 -7.301
(5.761)
dist_FR -14.970***
(2.302)
qualityCodeDscrAVERAGE + -32,314.160*
(18,231.170)
qualityCodeDscrAVERAGE ++ 31,199.560*
(18,721.230)
qualityCodeDscrEXCELLENT 1,209,125.000***
(42,347.180)
qualityCodeDscrEXCELLENT + 1,501,140.000***
(94,512.870)
qualityCodeDscrEXCELLENT++ 2,053,472.000***
(74,958.720)
qualityCodeDscrEXCEPTIONAL 1 1,178,518.000***
(94,737.590)
qualityCodeDscrEXCEPTIONAL 2 1,989,543.000***
(250,755.900)
qualityCodeDscrFAIR -74,775.010
(45,893.580)
qualityCodeDscrGOOD 70,174.540***
(13,001.650)
qualityCodeDscrGOOD + 109,881.300***
(21,877.160)
qualityCodeDscrGOOD ++ 205,970.400***
(19,933.570)
qualityCodeDscrLOW -124,112.600
(97,682.940)
qualityCodeDscrVERY GOOD 300,628.500***
(21,468.180)
qualityCodeDscrVERY GOOD + 617,758.800***
(37,791.240)
qualityCodeDscrVERY GOOD ++ 681,368.300***
(30,880.430)
designCodeDscr2-3 Story -27,609.520**
(11,101.130)
designCodeDscrBi-level 49,911.570**
(25,385.950)
designCodeDscrMULTI STORY- TOWNHOUSE -132,319.300***
(15,917.080)
designCodeDscrSplit-level 19,440.710
(16,834.630)
ZipCode80025 -364,412.500
(283,330.700)
ZipCode80026 -283,686.300
(216,120.600)
ZipCode80027 -260,814.200
(216,935.800)
ZipCode80301 -163,691.500
(217,934.200)
ZipCode80302 156,545.000
(219,405.000)
ZipCode80303 -105,060.900
(218,506.400)
ZipCode80304 143,416.500
(219,907.900)
ZipCode80305 -79,906.970
(219,864.600)
ZipCode80403 -127,448.300
(227,619.200)
ZipCode80422 -37,943.930
(264,007.600)
ZipCode80455 -215,880.600
(232,920.400)
ZipCode80466 -224,580.500
(217,922.700)
ZipCode80471 -382,909.300
(490,554.700)
ZipCode80481 -65,293.540
(225,752.000)
ZipCode80501 -372,582.700*
(216,143.100)
ZipCode80503 -423,489.700*
(216,770.900)
ZipCode80504 -385,146.600*
(216,204.100)
ZipCode80510 290,204.600
(235,334.200)
ZipCode80516 -406,437.100*
(216,585.600)
ZipCode80540 -309,398.300
(222,118.100)
ZipCode80544 -393,245.800
(305,797.900)
Constant 861,197.800***
(245,352.800)
Observations 11,252
R2 0.518
Adjusted R2 0.516
Residual Std. Error 428,929.700 (df = 11195)
F Statistic 214.978*** (df = 56; 11195)
Note: p<0.1; p<0.05; p<0.01

2.2 Correlation Matrix

The correlation matrix below depicts how related or unrelated each feature is to the others. For instance, how far a home is from the Front Range appears to be negatively correlated to price, i.e. closeness to the Front Range corresponds to higher price. (Importantly, note that correlation is distinct from causation and that the matrix only serves as a helpful guide for determining relevant factors for the model.)

numericVars <- 
  select_if(st_drop_geometry(boulder.sf), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +   
    labs(title = "Correlation across numeric variables",
         caption = "Figure 1.1")

2.3 Variable Correlation Scatterplots

Scatterplots are another means of showing correlation. The four features shown here are (moving clockwise beginning from the top left): (1) percentage of the population with income greater than $75,000 per year, (2) percentage of the population associated with professional or management services, (3) the average of the distances to the nearest five trailheads, and (4) the average of the distances to the nearest three schools.

The first two demographic features are positively correlated with price, meaning that a greater share of the population with higher income and in professional services corresponds to higher price. Contrariwise, homes that are further away from trailheads and schools correspond to lower prices.

st_drop_geometry(boulder.sf) %>%
  dplyr::select(price, pct.over75K, pct.Professional, schools_nn3, trailheads_nn5) %>%
  filter(price <= 4000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables",
          caption = "Figure 1.2") +
     plotTheme()

2.4 Spatial Distribution of Home Sale Prices in Boulder County

Although homes are sold for a wide range of prices across Boulder County, the prices that these homes have fetched tends to cluster in space. The following map demonstrates this phenomenon.

# Price per square foot
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = boulder.sf, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder.sf,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Sale Prices, Boulder County",
       caption = "Figure 1.3") +
  mapTheme()

2.5 Mapping Independent Variables

Just as the model’s outcome variable, home sale price, can be depicted on the map, so too can the features that the model will use to predict that outcome. Three of these features are mapped here.

First, the distance between each home and the Front Range is color-coded accordingly below, with the Front Range itself included for reference.

# Distance from the Front Range
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = FrontRange, colour = "#2d6a4f", size = 2) +
  geom_sf(data = boulder_homes_observed, aes(colour = q5(dist_FR)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder_homes,"dist_FR"),
                   name="Quintile\nBreaks") +
  labs(title="Home distance from Front Range",
       caption = "Figure 1.4") +
  mapTheme()

One of the U.S. Census variables, median household income, is mapped according by Census tract.

# Median Household Income in Boulder County
ggplot() + 
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = acsTractsBoulder.2019.sf, aes(fill = med_HH_Income)) +
  scale_fill_viridis_b() +
  labs(title = "Median Household Income in Boulder County",
       subtitle = "by Census Tract",
       caption = "Figure 1.5") +
  mapTheme()

Finally, the distribution of schools, both public and private, in Boulder County is depicted below.

# Distance from the Front Range
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = boulder_schools, colour = "#2d6a4f") +
  labs(title = "Schools in Boulder County",
       caption = "Figure 1.6") +
  mapTheme()

# Sale Price + zipcodes
ggplot() +
  geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
  geom_sf(data = acsTractsBoulder.2019.sf, fill = NA, colour = "#55286F") +
  geom_sf(data = boulder_homes_observed, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder_homes,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Sale Prices + Zip Code Areas",
       caption = "Figure 1.7") +
  mapTheme()

3 Methods

The type of statistical model used here is called a Linear Regression, or an Ordinary Least Squares regression. This type of model synthesizes a list of relevant components to produce a function representing the distribution of individual observations. In this case, the model was derived from the various data features described above, factoring in each home’s internal and environmental characteristics to predict price.

The strength of a prediction model can be distilled down to two interrelated qualities: accuracy and generalizability.

Accuracy refers to the ability of a model to produce predicted values that are as close as possible to the actual observed values. To test the accuracy of the model here, the original data was divided into two groups. One group, the training set, represented 75% of the original data and was used to create, or “train,” the regression model. The second group, the test set, represented the remaining 25% of the data and was used to measure how close, or not, the predictions came to the corresponding observed values. Accuracy can be measured by the metrics of “mean absolute error” (MAE) and “mean absolute percent error” (MAPE).

Generalizability refers to the ability of a model to make predictions based on new, unseen data. The generalizability of the model here was tested using a method known as cross-validation. The dataset was divided into 100 subsets, and each subset was further divided into training and testing sets. Errors across all 100 subsets were averaged to capture how well the model performs on data that it has yet to encounter.

4 Results

The following summary table presents the results of the linear regression on the training data set.

Dependent variable:
price
med_HH_Income -2.244***
(0.608)
pct.over75K 1,166,966.000***
(184,708.600)
pct.Information -2,544,788.000***
(437,167.000)
pct.Finance 15,107.560
(318,801.800)
pct.Professional -177,234.300
(185,614.800)
pct.Ed_Health -447,466.400***
(164,602.700)
nbrBedRoom 10,584.790
(6,465.900)
nbrFullBaths -26,951.370***
(7,701.113)
TotalFinishedSF 162.138***
(10.635)
AcDscrEvaporative Cooler 61,462.930
(108,280.800)
AcDscrNo AC 59,295.340
(103,736.400)
AcDscrWhole House 60,411.740
(103,734.600)
Age 936.334***
(274.080)
schools_nn3 -29.274***
(5.997)
trailheads_nn5 -10.707
(6.929)
dist_FR -13.671***
(2.785)
qualityCodeDscrAVERAGE + -36,815.380*
(21,834.160)
qualityCodeDscrAVERAGE ++ 20,464.010
(22,297.830)
qualityCodeDscrEXCELLENT 1,210,563.000***
(50,626.680)
qualityCodeDscrEXCELLENT + 1,427,054.000***
(109,284.500)
qualityCodeDscrEXCELLENT++ 1,861,940.000***
(85,542.050)
qualityCodeDscrEXCEPTIONAL 1 1,149,468.000***
(107,483.000)
qualityCodeDscrEXCEPTIONAL 2 1,990,171.000***
(269,800.200)
qualityCodeDscrFAIR -98,568.580*
(53,414.200)
qualityCodeDscrGOOD 60,064.260***
(15,832.510)
qualityCodeDscrGOOD + 103,969.300***
(26,379.780)
qualityCodeDscrGOOD ++ 198,980.300***
(24,016.660)
qualityCodeDscrLOW -138,747.500
(110,655.200)
qualityCodeDscrVERY GOOD 291,709.100***
(25,939.220)
qualityCodeDscrVERY GOOD + 608,531.300***
(44,908.030)
qualityCodeDscrVERY GOOD ++ 672,810.800***
(36,828.180)
designCodeDscr2-3 Story -24,415.950*
(13,495.780)
designCodeDscrBi-level 48,895.660
(29,870.130)
designCodeDscrMULTI STORY- TOWNHOUSE -127,779.600***
(19,323.430)
designCodeDscrSplit-level 16,070.850
(20,085.200)
ZipCode80025 -312,810.800
(305,551.100)
ZipCode80026 -279,657.800
(232,256.200)
ZipCode80027 -244,639.600
(233,343.600)
ZipCode80301 -125,181.200
(234,656.900)
ZipCode80302 191,607.800
(236,717.100)
ZipCode80303 -65,854.520
(235,418.600)
ZipCode80304 160,602.200
(237,354.500)
ZipCode80305 -53,060.890
(237,285.000)
ZipCode80403 -137,589.300
(247,045.000)
ZipCode80422 -42,427.790
(290,110.600)
ZipCode80455 -199,819.200
(251,932.100)
ZipCode80466 -231,143.700
(234,636.400)
ZipCode80471 -363,424.400
(527,629.800)
ZipCode80481 -61,651.600
(244,252.300)
ZipCode80501 -365,969.700
(232,286.000)
ZipCode80503 -420,085.000*
(233,137.300)
ZipCode80504 -376,917.600
(232,378.900)
ZipCode80510 258,247.400
(257,187.400)
ZipCode80516 -404,424.200*
(232,919.100)
ZipCode80540 -303,798.100
(240,025.400)
ZipCode80544 -390,646.000
(328,581.000)
Constant 869,273.200***
(266,092.900)
Observations 8,793
R2 0.492
Adjusted R2 0.488
Residual Std. Error 459,982.400 (df = 8736)
F Statistic 150.929*** (df = 56; 8736)
Note: p<0.1; p<0.05; p<0.01

4.1 Test Set Results

A summary of the mean absolute error and mean average percent error (MAPE) for the price prediction on the test data set is shown below.

Graphic representations of the results of the test set prediction are shown below.

# histogram of absolute errors
ggplot(boulder.test, aes(x = price.abserror)) +
  geom_histogram(binwidth=10000, fill = "green", colour = "white") +
  scale_x_continuous(limits = c(0, 1000000)) +
  labs(title = "Distribution of prediction errors for single test",
       x = "Sale Price Absolute Error", y = "Count") +
  plotTheme()

4.2 Cross Validation Results

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(boulder.sf), 
     method = "lm", trControl = fitControl, na.action = na.pass)

K-fold cross validation with 100 folds is used to explore the generalizability of this model. A histogram of the mean average error across the 100 folds is shown below.

# histogram of cross validation MAE
mae <- data.frame(reg.cv$resample[,3]) %>%
  rename(mae = reg.cv.resample...3.)

ggplot(mae, aes(x = mae)) +
  geom_histogram(binwidth=10000, fill = "orange", colour = "white") +
  scale_x_continuous(labels = c(0, 100000, 200000, 300000, 400000, 500000), 
                     limits = c(0, 500000)) +
  labs(title = "Distribution of MAE",
       subtitle = "k-fold cross validation; k = 100",
       x = "Mean Absolute Error", y = "Count") +
  plotTheme()

The prices predicted for the test set are plotted against the actual sale prices for the test set in the figure below.

ggplot(boulder.test) +
  geom_point(aes(price, price.predict)) +
  geom_smooth(aes(price, price), colour = "orange") +
  geom_smooth(method = "lm", aes(price, price.predict), se = FALSE, colour = "green") +
  labs(title = "Predicted sale price as a function of observed price",
       subtitle = "Orange line represents a perfect prediction; Green line represents prediction",
       x = "Observed Sale Price", y = "Predicted Sale Price") +
  plotTheme()

Residual absolute errors for the test set are mapped onto Boulder County below.

ggplot() +
  geom_sf(data = boulder_boundary, fill = "grey") +
  geom_sf(data = boulder.test, aes(colour = q5(price.abserror)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder.test,"price.abserror"),
                   name="Quintile\nBreaks") +
  labs(title="Test set absolute price errors",
       caption = "Figure X.X") +
  mapTheme()

Because of the geographically clustered nature of real estate, errors in price predictions tend also to cluster in space. This phenomenon is known as the “spatial lag.” The following plot depicts the spatial lag of errors in the model.

coords.test <- st_coordinates(boulder.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

boulder.test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.error)) %>%
  ggplot(aes(lagPriceError, price.error)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     labs(title = "Error as a function of the spatial lag of price errors") +
     plotTheme()

The clustering effect of home prices—the technical term is “spatial autocorrelation”—can also be demonstrated by the statistic known as Moran’s I. A Moran’s I that nears positive 1 is an indication of clustering, whereas a 0 value indicates a random distribution. In the figure below, the observed Moran’s I, depicted in orange, is contrasted with 999 randomly distributed values, showing that home prices in Boulder do indeed cluster in space.

moranTest <- moran.mc(boulder.test$price.error,
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

allPredictions <- boulder_homes %>%
  mutate(predictions = predict(reg1, boulder_homes)) %>%
  dplyr::select(predictions)

ggplot() +
  geom_sf(data = boulder_boundary, fill = "grey") +
  geom_sf(data = allPredictions, aes(colour = q5(predictions)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(allPredictions,"predictions"),
                   name="Quintile\nBreaks") +
  labs(title="Predictions for all homes in the dataset, Boulder County",
       caption = "Figure X.X") +
  mapTheme()

There is something so strange going on in this one zip code.

st_drop_geometry(boulder.test) %>%
  group_by(ZipCode) %>%
  summarize(mean.MAPE = mean(price.ape, na.rm = T)) %>%
  ungroup() %>% 
  left_join(boulder_zips) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = boulder.test, colour = "black", size = .5) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by Zip Code") +
      mapTheme()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"

We need the scatterplot of MAPE by zip of mean price by zip

testError_by_zips <-
left_join(
  st_drop_geometry(boulder.test) %>%
    group_by(ZipCode) %>%
    summarize(meanPrice = mean(price, na.rm = T)),
  st_drop_geometry(boulder.test) %>%
    group_by(ZipCode) %>%
    summarize(MAPE = mean(price.ape)))
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"
testError_by_zips %>%
  kable() %>% kable_styling()
ZipCode meanPrice MAPE
80026 635384.3 0.1809021
80027 722815.3 0.1935969
80301 828126.3 0.1925919
80302 1239133.0 0.3064021
80303 854227.0 0.2154622
80304 1432077.5 0.2781936
80305 917100.0 0.2027842
80403 425888.9 0.1832423
80422 535000.0 0.1743771
80455 618750.0 0.1385517
80466 550730.6 0.2435732
80481 408071.4 0.3302264
80501 434675.0 0.1999699
80503 696281.1 0.1877800
80504 510797.5 0.1793604
80510 268088.9 0.3440662
80516 626238.3 0.2524920
80540 484995.0 1.7348011
ggplot(testError_by_zips) +
  geom_point(aes(meanPrice, MAPE)) +
  geom_smooth(method = "lm", aes(meanPrice, MAPE), se = FALSE, colour = "green") +
  labs(title = "MAPE by Zip Code as a function of mean price by Zip Code",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme()

4.3 Generalizability

The model’s generalizability can be evaluated in relation to factors on the map. Below, two Census factors are depicted in Boulder County: race as a function of white vs. non-white and income level.

That Boulder County has a relatively racially homogeneous makeup would seem to indicate that the model is fairly generalizable on that score. Variations in income might present an obstacle to generalizability, in contrast.

(The greatest challenge to the model’s generalizability comes in the urban-rural distinction, however, as discussed later in this project.)

boulder_tracts19 <- 
  get_acs(geography = "tract", year = 2019, 
          variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          geometry = TRUE, state = "CO", county = "Boulder", output = "wide") %>%
  st_transform('ESRI:102254')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(boulder_tracts19), 
  aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(boulder_tracts19), 
  aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), 
    name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + 
    theme(legend.position="bottom"))

5 Discussion

The above results suggest that the model was effective in some respects and defective in others. In sum, the model was able to predict just under 50% of the variation in prices. The accuracy of the model varied widely according to the feature. The two that clearly outperformed the rest were distance to schools and the Front Range. Both were highly statistically significant and contributed substantially to the model, particularly when considered in the aggregate.

Conversely, the ZIP codes of each home were, surprisingly, not especially significant on their own. Despite one’s common sense intuition about the importance of a house’s ZIP code, the results suggest that on an individual basis, ZIP codes were not strongly determinative of price. That said, it is notable that when in the process of modeling ZIP codes were removed from the model, the overall accuracy declined markedly. That would indicate that ZIP codes, while relatively insignificant on a per-property basis, are integral to the model as a whole.

Both the strength and weakness of the model can be attributed to the geography of Boulder County. The model excelled in urban areas, clustered around Boulder and other large municipalities contained within the greater county. The reason for this is likely that environmental features such as nearness to schools and recreational amenities are of greater importance for homes in the denser parts of the county, whose residents actively consider such elements when purchasing a home.

On the other hand, the model saw a sizable drop in accuracy in the rural parts of the county. Certain properties located further in the mountains were clearly sui generis, with prices that diverged significantly from the rest. The model struggled to account for these properties, likely because many of the features important in urban areas are simply inapposite in the rural context. Homebuyers who are seeking a mountain getaway house, for example, are less interested in their house’s proximity to schools.

6 Conclusion

Although it represents a fine starting point, the model in its present form would likely not be ready for deployment by Zillow. Beyond even the need to improve base metrics such as average error, the more fundamental problems identified in the Discussion section would need to be remedied before Zillow’s vast user base could rely on the model.

Thankfully, several areas for improvement can already be identified. To start, more features should be added. Data which were not included in the model but which would doubtless prove useful include crime data, school districts ranked by desirability, and other features that better account for the clustering effects of home prices, such as neighborhoods. Although these data may not be available in pre-packaged form from the open data sources used here, these obstacles are likely overcome by clever engineering.

To the urban–rural issue articulated in the previous section, one possible solution is to revise the model to predict initially for price per square foot rather than total price. Price per square foot better measures the effect of location on property value, as it is more comparable across properties irrespective of the physical buildings. When coupled with more variables that account for the spatial clustering of price, a price-per-square-foot metric would likely distinguish between urban and rural properties better than the current model.